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Abstract 

Voids exist in proteins as packing defects and are often associated with pro- 
tein functions. We study the statistical geometry of voids in two-dimensional 
lattice chain polymers. We define voids as topological features and develop 
a simple algorithm for their detection. For short chains, void geometry is 
examined by enumerating all conformations. For long chains, the space of 
void geometry is explored using sequential Monte Carlo importance sampling 
and resampling techniques. We characterize the relationship of geometric 
properties of voids with chain length, including probability of void formation, 
expected number of voids, void size, and wall size of voids. We formalize 
the concept of packing density for lattice polymers, and further study the 
relationship between packing density and compactness, two parameters fre- 
quently used to describe protein packing. We find that both fully extended 
and maximally compact polymers have the highest packing density, but poly- 
mers with intermediate compactness have low packing density. To study the 
conformational entropic effects of void formation, we characterize the confor- 
mation reduction factor of void formation and found that there are strong 
end-effect. Voids are more likely to form at the chain end. The critical ex- 
ponent of end-effect is twice as large as that of self-contacting loop formation 
when existence of voids is not required. We also briefly discuss the sequential 
Monte Carlo sampling and resampling techniques used in this study. 
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I. INTRODUCTION 



Soluble proteins are well-packed, their packing densities may be as high as that of crys- 
talline solids Yet there are numerous packing defects or voids in protein structures, 
whose size distributions are broad @]. The volume {v) and area (a) of protein does not scale 
as f ~ a'^/^, which would be expected for models of tight packing. Rather, v and a scale 
linearly with each other 0]. In addition, the scaling of protein volume and cluster-radius |^ 
is characteristic of random sphere packing. Such scaling behavior indicates that the interior 
of proteins is more like Swiss cheese with many holes than tightly packed jigsaw puzzles 0. 

What effects do void have? Proteins are often very tolerant to mutations which 
may suggest potentially stabilizing roles of voids in proteins. Voids in proteins are also often 
associated with protein function. The binding sites of proteins for substrate catalysis and 
ligand interactions are frequently prominent voids and pockets on protein structures P,p!0|. 
However, the energetic and kinetic effects of maintaining specific voids in proteins are not 
well-understood, and the shape space of voids of folded and unfolded proteins are largely 
unknown. 

In this paper, we examine the details of the statistical nature of voids in simple lattice 
polymers. Lattice models have been widely used for studying protein folding, where the 
conformational space of simplified polymers can be examined in detail [^T|-|T^. Despite its 
simplistic nature, lattice model has provided important insights about proteins, including 
collapse and folding transitions |]I^, P^ , p(]| -P^ , influence of packing on secondary structure 



formation ||12| , p3| , and designability of lattice structures p^ , p5| . However, one drawback is 
that lattice model is not well-suited for studying void-related structural features, such as 
protein functional sites, since it is not easy to model the geometry of voids. 

In this article, we first define voids as topological defects and describe a simple algorithm 
for void detection in two-dimensional lattice. We then enumerate exhaustively the confor- 
mations for all n-polymers up to n = 22, and analyze the relationship of probability of void 
formation, expected packing density and compactness, as well as expected wall interval of 
void with chain length. To study statistical geometry of long chain polymers, we describe 
a Monte Carlo sampling strategy under the framework of Sequential Importance Sampling, 
and introduce the technique of resampling. The results of simulation of long chain polymers 
up to = 200 for several geometric parameters are then presented. We further explore the 
conformation reduction factor R of void formation, and describe the significant end-effect 
of void formation, as well as the scaling law of R and wall interval of voids. In the final 
section, we summarize our results and discuss effective sampling strategy for studying the 
conformational space of voids. 



II. LATTICE MODEL AND VOIDS 

Lattice polymers are self-avoiding walks (SAWs), which can be obtained from a chain- 
growth model [p^-|28[|. Specifically, an n-polymer P on a two-dimensional square lattice 
is formed by monomers nj,i G {1,..,A^}. The location Xi of a monomer rij is defined by 
its coordinates Xi = {ai,bi), where and bi are integers. The monomers are connected as 
a chain, and the distance between bonded monomers Xi and Xj+i is 1. The chain is self- 
avoiding: Xi 7^ Xj for all i ^ j- We consider the beginning and the end of a polymer to be 
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distinct. Only conformations that are not related by translation, rotation, and reflection are 
considered to be distinct. This is achieved by following the rule that a chain is always grown 
from the origin, the first step is always to the right, and the chain always goes up at the first 
time it deviates from the x-axis. For a chain polymer, two non-bonded monomers and rij 
are in topological contact if they intersect at an edge that they share. If two monomers share 
a vertex of a square but not an edge, these two monomers are defined as not in contact. 
(Figure |I| here.) 

When the number of monomer is 8 or more, a polymer may contain one or more void 
(Figure |I|a). We define voids as topological features of the polymer. The complement space 
I? — P that is not occupied by the polymer P can be partitioned into disjoint components: 

I? -P = VqUVi ... U Vfc. 

Here Vq is the unique component of the complement space that extends to infinity. We 
call this the outside. The rest of the components that are disjoint or disconnected to each 
other are voids of the polymer. Because non-bonded monomers intersecting at a vertex are 
defined as not in contact, they do not break up the complement space. As an example, the 
unfilled space contained within the polymer in Figure [I|b is regarded as one connected void 
of size 4 rather than two disjoint voids of size 2. A simple algorithm for void detection can 
be found in Appendix. Figure |^ shows the only six conformations among all 301,100,754 
conformations of 22-mer found to have 4 voids. 
(Figure |2| here.) 



III. VOIDS DISTRIBUTION BY EXACT ENUMERATION 

Probability of Forming Voids and Expected Number of Voids. The number of confor- 
mations uj{n) for ra-polymer up to n = 25 by exhaustive enumeration is shown in Table |. 
The numbers of conformations for polymers up to n = 15 are in exact agreement with 
those reported in Chan and Dill |jl2|. Table | also lists the number of conformations ujk{n) 



containing A; = 1, 2, 3, or 4 voids. 

The probability for a polymer to form one or more voids vr^ is calculated as: 



7r„ = 



The expected number of voids n„ for a polymer is: 



n,, 



uj{n) 



As the chain length grows, it is clear that both tt^ and n„ increases (Figure ^ and Figure ^d). 
(Figure |^ here.) 

Void Size. The total size v of voids in a polymer is the sum of the sizes of all voids, 
namely, the total number of all unoccupied squares that are fully contained within the 
polymer. Let uj^in) be the number of conformations of ri-polymer with total void size v. 
The expected total void size v for n-polymer is: 
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V = . 

uj[n) 

Figure ^ shows that the expected void size v increases with chain length n. 

Wall Size of Void. For a void V of size v, what is the required minimum length l{v) 
for a polymer that can form such a void? Equivalently, what is the size of the wall of the 
polymer containing void V7 Here we first restrict our discussion to voids formed only by 
strongly connected unoccupied sites, namely, any neighboring two sites of a void must be 
sharing at least one edge of the squares. We exclude voids containing weakly connected 
sites, where two neighboring sites are connected by only one shared vertex (Figure |I]b). For 
V = 1,2 and 3, it is easy from the geometry of the voids to see that l{v) = 8, 10 and 12, 
respectively. However, in general l{v) also depends on the shape of the void. A void of size 
4 can have five different shapes. If the void is of the shape of a 2 x 2 square, /(4) = 12. For 
the other four shapes, /(4) = 14. 

For any strongly connected void, we find that the following general recurrence relationship 
for l{v) holds: 

r 2, ifAdV = 3 
/(t;) = /(i; - 1) + J 0, iiAdV = 2 
[ -1, ifAdV = 1, 

where dV represents the boundary edges of void V, and AdV represents the net gain in the 
number of boundary edges introduced by the newly added unoccupied site. Although the 



number and explicit shapes of strongly connected voids of size up to 5 can be found in |^ 
there is no general analytical formula known for the number of shapes of a void of size v. 
This is related to the problem of determining the number of polyominos or animals (as in 
percolation theory) of a given size. 

When weakly connected voids are also considered, there are more possible wall sizes for 
void. For 22-mer, the number of different wall sizes observed for a void, strongly or weakly 
connected, at various size are shown in Figure §a. Voids of size 5 has the largest diversity in 
wall size. This is of course due to the fixed chain length. A short chain such as the 22mer 
has only a small number of ways for form large voids. Figure §3 shows the average wall 
size for various void size in 22-mer. The expected or average wall size w{n) for a void in a 
n-polymer can be calculated as: 

w{n} = }_^w -— , 

where v is the void size, w the wall size of the void, ijj,u,w{f^) is the number of n-polymers 
containing a void of size v with wall size w, and a;„(n) is the total number of n-polymers 
with a void of size v. Figure ^ shows that w{n) increases with chain length. Wall size and 
void size are analogous to the area and volume of voids in three dimensional space. 

Packing Density. An important parameter that describes how effectively atoms fill 
space is the packing density p. In proteins, it is defined by Richards and colleagues as the 
amount of the space that is occupied within the van der Waals envelope of the molecule, 
divided by the total volume of space that contains the molecule It has been widely 

used by protein chemists as a parameter for characterizing protein folding 0. Following 
this original definition, the packing density p for lattice polymer is: 
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p = n/{n + v), 



when a n-polymer has a total void size of v. 

The expected packing density p{n) for a n-polymer can be calculated as: 

where Lo{n) is the number of all conformations of n-mer, ujp{n) the number of ?7,-mers with 
packing density of p. The scaling of p{n) with the chain length n decreases roughly linearly 
between n = 7 and n = 22 (Figure |]a). Because it takes at least two additional monomer 
to increase the size of a void by one, p{n) decreases only when n is an odd number for short 
chains. 

Although voids are packing defects, most conformations with voids have high packing 
density, namely, the total size of voids are small. Among all conformations of 22-mer contain- 
ing one void, the number of conformation increases monotonically with packing density. The 
lowest packing density 0.52 has only 11 conformations, whereas the highest packing density 
0.92 has the largest number (6, 756, 751) of conformations (Figure ^). Similar relationship 
is found among conformations with 2 and 3 voids (Figure^c). 

Compactness. Another important parameter that measures packing of lattice polymer 
is the number of nonbonded contacts t. It is related to the compactness parameter p, defined 
in [jl2| as p = t/t^s,x, where tmax is the maximum number of nonbonded contact possible for a 
n-polymer. Compactness p has been studied extensively in seminal works by Chan and Dill 
[1^,23,^. Although p is sometimes correlated with the compactness p, these two parameters 



are distinct. The relationship between compactness and expected packing density for chain 
polymer of length 14 — 22 is shown in Figure For all chain lengths, both maximally 
compact polymer (p = 1) and extended polymer (p = 0) have maximal packing density 
{p = 1). Polymers with p between 0.4 and 0.6 have lowest packing density and therefore 
tend to have larger void size. The explanation is simple. An extended lattice chain polymer 
has no voids, it therefore achieves maximal packing density of p = 1, but its compactness p 
is 0. A maximally compact polymer with p = 1 also contains no voids, its p is 1. On the 
other hand, non-maximally compact polymers can have a range of packing densities. 
(Figure |5| here.) 



IV. OBTAINING VOID STATISTICS FOR LONG CHAINS VIA IMPORTANCE 

SAMPLING 

Sequential Importance Sampling. Geometrically complex and interesting features emerge 
only in polymers of sufficient length, which are not accessible for analysis by exhaustive 
enumeration, due to the fact that the number of possible SAWs increases exponentially with 
the chain length. Monte Carlo methods are often used to generate samples from all possible 
conformations and obtain estimates of feature statistics using those samples. However, 
when chain length becomes large, the direct generation of SAWs using rejection method 
{i.e., generate random walks on the lattice and only accept those that are self-avoiding) 
from the uniform distribution of all possible SAWs becomes difficult. The success rate s^ 
of generating SAWs decreases exponentially: sjy ~ ^Ar/(4 x 3^"-*^). For = 48, sjy is only 
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0.79% p2|. To overcome this attrition problem, a widely used approach is the Rosenbluth 



Monte Carlo method of biased sampling |^| . The task is to grow one more monomer for a 



t-polymer chain that has been successfully grown from 1 monomer after t — 1 successive steps 
without self-crossing, until t = n, the targeted chain length. In this method, the placement of 
the (t+l)-th monomer is determined by the current conformation of the polymer. If there are 
rit unoccupied neighbors for the t-th monomer, we then randomly (with equal probability) 
set the (t + l)-th monomer to any one of the rit sites. However, the resulting sample is 
biased toward more compact conformations and does not follow the uniform distribution. 
Hence each sample is assigned a "weight" to adjust for the bias. Any statistics can then 
be obtained from weighted average of the samples. In the case of Rosenbluth chain growth 
method, the weight is computed recursively as Wt = ntWt~i- 

Liu and Chen |Q provided a general framework of Sequential Monte Carlo (SMC) meth- 



ods which extend the Rosenbluth method to more general setting. Sophisticated but more 
flexible and effective algorithms can be developed under this framework. In the context of 
growing polymer, SMC can be formulated as follows. Let (xi, . . . , Xj) be the position of the 
t monomers in a chain of length t. Let 7ri(xi), 7r2(xi, X2), • • • , vr((xi, . . . , x^) be a sequence of 
target distributions, with 7r(xi, . . . ,x„) = 7r„(xi, . . . ,x„) being the final objective distribu- 
tion from which we wish to draw inference from. Let gt+i{xt+i \ xi, . . . , Xt) be a sequence of 
trial distributions which dictates the growing of the polymer. Then we have: 

Procedure SMC {n) 

Draw Xi"''', j = 1, . . . , m from gi{xi) 
Set the incremental weight w^'' = Tii{xi'') / gi{xi'') 
for t = 1 to n — 1 
for j = 1 to m 

// Sampling for the [t + l)-th monomer for the j-th sample 
Draw position x[:^\ from 
5-4+1 (xi+i|xS^^ . . .xp^) 
// Compute the incremental weight, 
^(i) ^ 7r4+i(xS^^..xiji) 

TTt{Xi ...X-t ) ■ gt+l\Xt+i\Xi ...Xt ) 

^t+1 ^ "t+1 

endf or 
Resampling 
endf or 

At the end, the configurations of successfully generated polymers {(x^^''', . . . , x^^^)}^^ and 
their associated weig I can be used to estimate any properties of the polymers, 

such as expected void size, compactness, and packing density. That is, the objective inference 
Hh = -E7r[/i(xi, . . . , x„)] is estimated with 

^ ^i^^ ■ 

for any integrable function h of interests. 
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The critical choices that affect the effectiveness of the SMC method are: (1) the approxi- 
mating target distribution 7rj(xi . . . Xt), (2) the sampling distribution gt+i{xt+i\xi . . . Xt), and 
(3) the resampling scheme. In this study, we are interested in sampling from the uniform 
distribution 7r„(xi . . . x„) of all geometrically feasible conformations of length n, which we 
call the final objective distribution. It can also be chosen to be the Boltzmann distribution 
when energy function such as the HP model ||3^JTT|j35[| is introduced. 

2B[ is a special case of SMC. Its target distributions 

Its sampling distribution 
Xt) unoccupied neighbor- 



The Rosenbluth method 



TTt{xi . . .Xt) is the uniform distribution of all SAWs of length t 
gt+i{xt+i\xi . . .Xt) is the uniform distribution among all ni{xi, . . 
ing sites of the last monomer Xt, and the weight function is 



w{xi, ...,xt, xt+i) = w{xi, xt)ni{xi, 



When there is no unoccupied neighboring sites (?t,i(xi, . . . , a;f) = 0), there is no place to 
place the {t + l)-th monomer. In this case, the chain runs into a dead end and we declare 
the conformation dead, with weight assigned to be 0. In the case of Rosenbluth method, no 
resampling is used. 

Similarly, the fc-step look ahead algorithm |S^J5^ chooses 7rj+i(xi, . . . , Xt+i) being the 
marginal distribution of 7r^*_|^^(xi, . . . , Xt+k)i the uniform distribution of all SAW's of length 
t + k. Hence ttj+i is closer to the final objective distribution - the uniform distribution of 
all SAW's of length n. Specifically, 



7rj+i(xi, . . . , xt+i) = J2 K+ki^i^ 
oc nk{xi, . . .,xt+i) 



I Xt+l, Xt+2, 



Xt+k) 



where nk{xi, . . . , Xt+i) is the total number of SAWs of length t+k "grown" from (xi, . . . , Xt+i) 
[i.e. with the first {t + 1) positions at (xi, . . . ,Xt+i).] In the fc-step look-ahead algorithm, 
the sampling distribution is 



9t+i{xt^ 



X Xi, 



.Xt 



UkiXi, 



xt,x) 



nk+i{Xi, 



.Xt 



It chooses the next position according to what will happen k steps later. Namely, the 
probability of placing the t + 1-th monomer at x is determined by the ratio of the total 
number of SAWs of length t + k grown from {xi, . . . ,xt,x) and the total number of SAWs 
of the same length t + k grown from one step earlier (xi, . . . ,Xt). The corresponding weight 
function is 



W{xi, . . .,Xt,Xt+l) 



nk{xi, . . .,xt+i) 



nk{xi,...,xt+i) 
ni.+i{xi,...,xt) 



nk{xi, ...,xt) 
_ nfc+i(xi, ... ,xt) 
nkixi,...,Xt) 

Although it has higher computational cost, it usually produces better inference on the final 
objective distribution, with less "dead" conformations. The standard Rosenbluth algorithm 
is 1 step look ahead algorithm. 
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To compare geometric properties estimated from sequential Monte Carlo method and 
those obtained by exhaust enumeration, we examine the expected number of voids and 
expected void size for polymer from chain length 14 to 22. Figure ^ shows that sequential 
Monte Carlo can provide very accurate estimation of these geometric properties of voids. 
Here 2-step look ahead is used, with Monte Carlo sample size of 100,000 and no resampling. 

(Figure ^ here.) 

The resampling step is one of the key ingredient of the SMC P7| , |55| . There are many 



cases where resampling is beneficial. First, note that it is unavoidable to have some dead 
conformations during the growth. These chains need to be replaced to maintain sufficient 
Monte Carlo sample size. Second, the weight of some chains may become relatively so small 
that their contribution in the weighted average (|l|) is negligible. When the variance of the 
weights is large, the effective Monte Carlo sample size becomes small p8| , |37| , |33[] . Third, for 



a specific function /i, its value may become too small (even zero) for some sampled con- 
formations. In all these cases, efficiency can be gained by replacing those conformations 
with "better" ones. This procedure is called "resampling". There are many different ways 
to do resampling. One approach is rejection control which regenerates the replace- 
ment conformations from scratch. An easier approach is to duplicate the existing and good 
conformations PB[. Specifically, 



Procedure Resampling 
//m: number of original samples. 

/ / {{xi \ . . . ,x^^^),w'^^^}f^i. original properly weighted samples 
for j = 1 to m 

Set resampling probability of jth conformation oc a^^^ 
endf or 

for *j = 1 to m 

Draw *jth sample from original samples {{x^i \ ■ ■ ■ ,Xt'^^}^i 

with probabilities oc 
//Each sample in the newly formed sample is assigned a new weight. 
//*j-th chain in new sample is a copy of A;-th chain in original sample. 

endf or 



In the resampling step, the m new samples {(a;^*"''', . . . ,Xf*''^}'f=i can be obtained either 
by residual sampling or by simple random sampling. In residual sampling, we first obtain 
the normalized probability a^^^ = a 

Then [ma^^^] copies of j-th sample are made 
deterministically for j = 1, . . . , m. For the remaining m — samples to be made, we 

randomly sample from the original set with probability proportional to ma^^^ — [ma^^^]. 

The choice of resampling probability proportional to a^^^ is problem specific. For general 
function h, such as the end-to-end extension — it is common to use a'^^^ = In 
this case, all the samples in the new set have equal weight. When the function is irregular, 
a carefully chosen set of a^^^ will increase the efficiency significantly. 

The method of pruning and enriching of Grassberger [4^ is a special case of the residual 
sampling, with a^^'^ = for the k chains with zero weight (dead conformations), a^^^ = 2 
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for the top k chains with largest weights, and a^^^^ = 1 for the rest of the chains. Residual 
sampling on this set of a is completely deterministic. The resulting sample consists of two 
copies of the top k conformations (each of them having half of their original weight) and 
one copy of the middle n — 2k chains with their original weight. The k dead conformations 
are removed. 

In our study of the relationship between compactness and packing density, we use a more 
flexible resampling method. Our focus is on the packing density among all conformations 
with certain range of compactness. In this case, our object target distribution is the uniform 
distribution among all possible SAW's with compactness measure falling within a certain 
interval, i.e., a truncated distribution. Although compactness changes slowly as the chain 
grows, to grow into a long chain it is possible that the compactness of a chain evolve and 
cover a wide range during growth. Hence we choose the uniform distribution of all possible 
SAWs' of length t as our target distribution at t, and only select those with the desire 
compactness at the end for our estimation of the packing density. In order to have higher 
number of usable samples {i.e.,, to achieve better acceptance rate) at the end, we encourage 
growth of chains with desirable compactness through resampling. Specifically, 

Procedure Resampling (m, d, q) 

// m: Monte Carlo sample size, d: steps of looking-back. 

// q: targeting compactness. 

k <— number of dead conformations. 

Divide m — k samples randomly into k groups. 

for group i = 1 to k 

Find conformations not picked in previous d steps. 
//Pick the best conformation Pj, for example 
Pj ^ polymer with min |c — qI 
Replace one of k dead conformations with Pj 
Assign both copies of Pj half its original weight, 
endf or 

Here d is used to maintain higher diversity for resampled conformations. 

Most polymers sampled by sequential Monte Carlo without resampling are well-extended 
with few voids, as shown in Figure |^(a) and (b). They have small compactness (less than 
0.5) and large packing density. As a result, a small number of samples are accepted at the 
end whose compactness falls within the desired interval of higher than 0.5. By using the 
resampling step described above, we were able to generate more samples near the desired 
compactness value of 0.6 (Figure |^). Figure |^ is a pure histogram of compactness in the 
observed samples, without regarding the weight of the samples. Figure |^ shows that the 
resampling technique is also very effective in shifting the samples to small packign density 
values, hence improve the inferences. 

(Figure here.) 

V. VOIDS DISTRIBUTION OF LONG CHAINS 

We apply the techniques of sequential Monte Carlo with resampling to study the statis- 
tical geometry of voids in long chain polymers. Figure ^a shows that the probability of void 



9 



formation increases with the chain length. At chain length 105-110, about half of the con- 
formations contain voids. The expected number of voids (Figure §b) increases linearly with 
chain length. Similar linear scaling behavior is also observed in proteins [Q. The expected 
wall size of void and void size also increase with chain length (Figure ^jc and Figure 

The expected packing density is found to decrease with chain length, which is consistent 
with the scaling relationship of void size and chain length shown in Figure ^jc. The com- 
pactness p of chain polymer has been the subject of several studies pT| , p!2[ . The asymptotic 
value of p we found is 0.18, slightly different from that reported in |41| (p = 0.16), and is 
within the range of 0.16 - 0.24 reported in W2\. 

(Figure |^ here.) 

To explore the relationship of packing density p and compactness p, we use sequential 
Monte Carlo with 2-step look- ahead to sample 200,000 conformations, each with appropriate 
weight assigned. This is repeated 20 times, and the weighted average values of packing 
density at various compactness for chains with 60-100 monomers are plotted (Figure 
The compactness value corresponding to the minimum packing density seems to have shifted 
from 0.462 for 22-mer by enumeration to above 0.5 for 100-mer by sampling. However, the 
overall pattern of p and p found by Monte Carlo is very similar to the pattern found by 
enumeration for polymers with N < 22. 

The accuracy of geometric properties of long chain polymers estimated by Monte Carlo 
can be assessed by the variance obtained from multiple Monte Carlo runs. 

(Figure |^ here.) 



VI. END EFFECTS OF VOID FORMATION 



What is the effect of void formation on the size of conformational space? We consider the 
conformational reduction factor of voids. Following [ P^ , pi| , 
reduction factor due to the constraint of a void as: 



we define the conformational 



R{n;i,j] 



uj{n) 



where uj{n; i,j) is the number of conformations that contains a void beginning at monomer 
{i) and ending at monomer (j), and u}{n) is the total number of conformations of ra-polymers. 
R{n; reflects the restriction of conformational space due to the formation of a void with 
wall interval of k = \i — j\. Figure |TU|a shows a 24-mer with one void that starts at i = 4 and 
k = 19. Unlike self-contacts or self-loops, which was subject of detailed studies by Chan and 
Dill [|l^,^,^], all conformations analyzed here must contain a void. The polymer shown 
in Figure |10| with a large loop has no void, and such polymers do not contribute to the 
numerator of R. 



(Figure |T0| here.) 



Figure p7L| a shows the reduction factor R calculated by enumeration for voids at different 
starting positions with wall intervals k = 7,9 and 11. There are clearly strong end-effects: 
The reduction factor of voids of the same wall interval depends on where the void is located. 
R decreases rapidly as the void moves from the end of chain towards the middle. Void 
formation is much more preferred at the end of chain. Similar end effects of void formation 



are also observed for 55-mer sampled by sequential Monte Carlo (Figure [TT|b). 
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(Figure |TT] here.) 

The end-effect of voids has the same origin as the end-effect of self-contact, which has 



been extensively studied by Chan and Dill Because of the effect of excluded 

volume, sterically it is less hindering to form a void at the end of a polymer. When a void 
is formed, the conformational space of the k + 1 monomers between monomer i and j, as 
well as the two tails become restricted. When void is formed at chain end, only one tail is 
subject to conformational restriction. 

Void formation is different from self-contact. When monomer i and j form self-contact, it 
may involve the formation of a void, but it is also possible that there will be no unfilled space 
between i and j. When a void is formed beginning at monomer i and ending at monomer 
j, some monomers between i and j will have unsatisfied contact interactions. Compare to 
non-bonded self-contact, the effect of conformation reduction is more pronounced for void 
formation. For two-dimensional lattice, the ratios between reduction factors of self-contact 
at chain end and mid chain of a sufficiently long polymer are 1.3, 1.4, 1.5 and 1.6 for A; = 3, 5, 7 
and 9, respectively whereas the ratios for voids at chain end and the symmetric midpoint 
of = 22 polymer are 3.4,4.0, and 4.4. for k = 7,9 and 11. The conformational reduction 
factor R{i,j) for voids at various beginning positions i and various ending position j can be 
summarize in a two-dimensional contour plot as shown in Figure pn]b. 

We now consider the power-law dependence of -R(A^; i,j) on the wall interval k = \i — j\. 
In the studies of self-contacting loops by Chan and Dill the scaling exponent u of the 
reduction factor R and loop length k = \i — j\ for R{N; ^ k~'^ is found to be dependent 
both on k and the location of the cycle in the chain. The values of u for self-contact range 
from 1.6 when k = N to 2.4 when the loop is in the middle of a long chain with two 
long tails. Because void formation involves at least 8 monomers, its scaling behavior is less 
amenable to exhaust enumeration, and application of Monte Carlo sampling is essential. 
Based on estimations from Monte Carlo simulation of void formation in 50-mer, u ranges 
from 1.4 ± 0.2 for /q = 1 to 3.0 ± 0.2 for void initiation position Iq = 8 (Figure P^Ijc). Our 
results show that the scaling exponent of R with k = \i — j\ for void formation is similar to 
that of self-contacting loop. This scaling exponent also depends on the location of the void. 



VII. CONCLUSION 

In this work, we have studied the statistical geometry of voids as topological features in 
two-dimensional lattice chain polymers. We define voids as unfilled space fully contained 
within the polymer, and have developed a simple algorithm for its detection. We have 
explored the relationship of various statistical geometric properties with the chain length of 
the polymer, including the probability of void formation vr^, the expected number of voids 
Uv, the expected void size v, the expected wall size of voids w, packing density p, and the 
expected compactness p. Our results show that for chains of > 105-1 10 monomers, at least 
half of the conformations contain a void. At about 150 monomers, there will be at least 
one void expected in a polymer. The expected wall size scale linearly with the chain length, 
and about 10% of the monomers participate in the formation of voids. We formalize the 
concept of packing density for lattice polymers. We found that both the packing density 
and compactness decrease with chain length. The asymptotic value of compactness p is 
estimated to be 0.18. 
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We have also characterized the relationship of packing density and compactness, two 
parameters that have been used frequently for studying protein packing. Our results indicate 
that packing density reaches minimum values between compactness 0.4 - 0.6. The entropic 
effects of voids are studied by analyzing the conformational reduction factor R of void 
formation. We found that there is significant end-effect for void formation: the ratio of R at 
chain end and at mid chain may be twice as large as that of the R factor for contact loops, 
where the formation of voids is not required. 

In this study, we have applied sequential Monte Carlo sampling and resampling tech- 
niques to study the statistical geometry of voids. Sequential Monte Carlo sampling and 
resampling is essential for exploring the geometry of long chain polymers. This is a very 
general approach that allows the generation of increased number of conformations with in- 
teresting characteristics. For example, we can replace dead conformations with existing 
conformations of highest weight, or conformations with highest compactness, or with small- 
est radius of gyration. Figure shows the histograms of conformation of 100-mer at 
different packing density generated without resampling. Figure [l^c shows the histograms of 
conformations when resampling by weight and resampling by compactness p are used. Other 
resampling schemes are possible, e.g. resampling by radius-of-gyration, by packing density. 
During resampling, the number k of dead conformations at each step of growth is identi- 
fied and these are replaced with conformations of interest from k randomly divided groups. 
These conformations must have not been resampled in previous 4 steps of the growth process 
to maintain sample diversity. Both histograms where resampling is used deviate from that 
of Figure |l^a. Resampling by weight shifts the peak of the conformations to below 0.2, and 
resampling by compactness turns the histogram into bi-modal. The latter produces a lot 
more conformations with compactness p > 0.4. 

SMC sampling and resampling use biased samples since conformations are generated 
with probability different from that of the target distribution. The bias is ictated by dif- 
ferent method of resampling and different choices of the number of steps of look-ahead in 
sequential Monte Carlo. An essential component of a successful biased Monte Carlo sam- 
pling is the appropriate weight assignment to each sample conformation. This is necessary 
because we need to estimate the expected values of parameter such as packing density 
and void size under the target distribution of all geometrically feasible conformations. In 
Figure |l2|a where each of the 200,000 starting conformations is generated by two-step look- 
ahead without resampling, not every conformation is generated with the same probability 
and therefore is assigned different weight accordingly. Figure shows the weight-adjusted 
histogram, which is indicative of the probability density function at different compactness 
for the population of all geometrically feasible 100-mers. 

Figure |12|d shows that when weights are incorporated and the area of the histogram nor- 
malized to the final number of surviving conformations, the weighted distributions of confor- 
mations using different resampling techniques have excellent agreement with the weighted 
distribution when no resampling is used (Figure |l2|b). This example shows that by incorpo- 
rating weights, the target distributions can be faithfully recovered even when the sampling 
is very biased. 

(Figure |T2| here.) 

Although sequential Monte Carlo sampling is very effective, the estimation of parameters 
associated with rare events remain difficult. In Figure O where conformational reduction 
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factor R is plotted at various void initiation position and wall interval length, voids starting 
at position of 1 but with odd wall intervals {k G {11, 13, 25}) are much rarer, and it 
is unlikely sequential Monte Carlo sampling with limited sample size can provide large 
enough effective sample size for the accurate estimation of scaling parameters u, where 
R{N;i,j) ^ k-\ 

In this study, we are interested in the statistics of void geometry, and our target distri- 
bution is the uniform distribution of all conformations of length n. With the introduction of 
appropriate potential function and alphabet of monomers such as the HP model p4| , p!T| , |35| , 
we can study the thermodynamics, kinetics, and sequence degeneracy of chain polymers 
when voids are formed in polymers. In these cases, our target distributions will be chain 
polymers under the Boltzmann distribution derived from the corresponding potential func- 
tions. 
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IX. APPENDIX 



To detect voids in a polymer, we use a simple search method. For an I x I lattice, we 
start from the lower-left corner. Once we found an unoccupied site u, we use the breadth- 
first-search (BFS) method to identify all other unoccupied sites that are connected to site 
u. These sites are grouped together and marked as "visited". Collectively they represent 
one void in the lattice. We continue this process until all unoccupied sites are marked as 
visited: 

Algorithm VoidDetection {lattice, I) 
V = / / Number of voids 
for i = 1 to / 
for j = 1 to / 

if site(z,j) is unoccupied and not visited 

V ^ V + 1 

Mark as visited. 

BREADTHFlRSTSEARCH(/attzce, (z, j)) 

Update the size of void{i,j) 
endif 
endf or 
endf or 

Details of BFS can be found in algorithm textbooks such as 0] . 



14 



REFERENCES 



[1] F.M. Richards. Areas, volumes, packing, and protein structures. Ann. Rev. Biophys. 

Bioeng., 6:151-176, 1977. 
[2] C. Chothia. Structural invariants in protein folding. Nature, 254:304-308, 1975. 
[3] F.M. Richards and W.A. Lim. An analysis of packing in the protein folding problem. 

Q. Rev. Biophys., 26:423-498, 1994. 
[4] J. Liang and K.A. Dill. Are proteins well-packed? Biophys. J., 81:751-766, 2001. 
[5] B. Lorenz, I. Orgzall, and H-0. Heuer. Universality and cluster structures in continuum 

models of percolation with two different radius distributions. J. Phys. A: Math. Gen., 

26:4711-4722, 1993. 

[6] W.A. Lim and R. Sauer. Alternative packing arrangements in the hydrophobic core of 
A repressor. Nature, 339:31-36, 1989. 

[7] D. Shortle, W.E. Stites, and A.K. Meeker. Contributions of the large hydrophobic amino 
acids to the stability of staphyloccocal nuclease. Biochemistry, 29:8033-8041, 1990. 

[8] D.D. Axe, N.W. Foster, and A.R. Fersht. Active barnase variants with completely ran- 
dom hydrophobic cores. Proc. Natl. Acad. Sci. USA, 93:5590-5594, 1996. 

[9] R.A. Laskowski, N.M. Luscombe, M.B. Swindells, and J.M. Thornton. Protein clefts in 
molecular recognition and function. Protein Sci., 5:2438-2452, 1996. 
[10] J. Liang, H. Edelsbrunner, and C. Woodward. Anatomy of protein pockets and cavities: 
Measurement of binding site geometry and implications for ligand design. Protein Sci, 
7:1884-1897, 1998. 

[11] K.F. Lau and K.A. Dill. A lattice statistical mechanics model of the conformational and 

sequence spaces of proteins. Macromolecule, 93:6737-6743, 1989. 
[12] H.S. Chan and K.A. Dill. Compact polymers. Macromolecules, 22:4559-4573, 1989. 
[13] K.A. Dill. Dominant forces in protein folding. Biochemistry, 29:7133-7155, 1990. 
[14] E. Shakhnovich and A. Gutin. Enumeration of all compact conformations of copolymers 

with random sequence of links. J. Chem. Phys, 93:5967-5971, 1990. 
[15] C.J. Camacho and D. Thirumalai. Kinetics and thermodynamics of folding in model 

proteins. Proc. Natl. Acad. Sci. USA, 90:6369-6372, 1993. 
[16] V. S. Pande, C. Joerg, A. Yu Grosberg, and T. Tanaka. Enumeration of the Hamiltonian 

walks on a cubic sublattic. J. Phys. A, 27:6231, 1994. 
[17] N. D. Socci and J. N. Onuchic. Folding kinetics of proteinlike heteropolymer. J. Chem. 

Phys., 101:1519-1528, 1994. 
[18] K.A. Dill, S. Bromberg, K. Yue, K.M. Fiebig, D.P. Yee, P.D. Thomas, and H.S. Chan. 

Principles of protein folding-a perspective from simple exact models. Protein Sci, 4:561- 

602, 1995. 

[19] A. Sali, E.I. Shakhnovich, and M. Karplus. How does a protein fold? Nature, 369:248- 
251, 1994. 

[20] I. Shrivastava, S. Vishveshwara, M. Cieplak, A. Maritan, and J. R. Banavar. Lattice 
model for rapidly folding protein- like heteropolymers. Proc. Natl. Acad. Sci. U.S. A, 

92:9206-9209, 1995. 

[21] D.K. Klimov and D. Thirumalai. Criterion that determines the foldability of proteins. 

Phys. Rev. Lett, 76:4070-4073, 1996. 
[22] R. Mehn, H. Li, N. Wingreen, and C. Tang. Designability, thermodynamic stability. 



15 



and dynamics in protein folding: a lattice model study. J. Chem. Phys., 110:1252-1262, 

1999. 

H.S. Chan and K.A. Dill. The effects of internal constraints on the configurations of 
chain molecules. J. Chem. Phys., 92:3118-3135, 1990. 

S. Govindarajan and R.A. Goldstein. Searching for foldable protein structures using 
optimized energy functions. Biopolymers, 36:43-51, 1995. 

H. Li, R. Helling, C. Tang, and N. Wingreen. Emergence of preferred structures in a 
simple model of protein folding. Science, 273:666-?, 1996. 

M.N. Rosenbluth and A. W. Rosenbluth. Monte Carlo calculation of the average exten- 
sion of molecular chains. J. Chem. Phys., 23:356-359, 1955. 

D. Frenkcl and B. Smit. Understanding molecular simulation: From algorithms to ap- 
plications. Academic Press, San Diego, 1996. 

D.P. Landau and K. Binder. Monte Carlo simulations in statistical physics. Cambridge 
University Press, Cambridge, 2000. 

S.W. Golomb. Polyominoes: Puzzles, patterns, problems, and packings. Princeton Uni- 
versity Press, 1994. 

P.M. Richards. The interpretation of protein structures: total volume, group volume 
distributions and packing density. J. Mol. Biol., 82:1-14, 1974. 

H.S. Chan and K.A. Dill. Intrachain loops in polymers: Effects of excluded volume. J. 
Chem. Phys., 90:492-509, 1989. 

Jun S. Liu. Monte Carlo strategis in scientific computing. Springer, New York, 2001. 
J.S. Liu and R. Chen. Sequential monte carlo methods for dynamic systems. Journal of 

the American Statistical Association, 93:1032-1044, 1998. 

K.A. Dill. Theory for the folding and stability of globular proteins. Biochemistry, 
24:1501, 1985. 

H.S. Chan and K.A. Dill. Energy landscapes and the collapse dynamics of homopolymer. 

Journal of Chemical Physic, 97:12995-12997, 1993. 

H. Meirovitch. A new method for simulation of real chains: Scanning future steps. J. 
Phys. A: Math. Gen., 15:L735-L741, 1982. 

J.S. Liu and R. Chen. Blind deconvolution via sequential imputations. Journal of the 

American Statistical Association, 90:567-576, 1995. 

A. Kong, J.S. Liu, and W.H. Wong. Sequential imputations and Bayesian missing data 
problems. J. Amer. Statist. Assoc, 89:278-288, 1994. 

J.S. Liu, R. Chen, and W.H. Wong. Rejection control and importance sampling. Journal 
of American Statistical Association, 93:1022-1031, 1998. 

P. Grassberger. Pruned-enriched Rosenbluth method: Simulation of 6 polymers of chain 

length up to 1,000,000. Phys. Rev. E., 56:3682-3693, 1997. 

T. Ishinabe and Y. Chikahisa. Exact enumerations of self-avoiding lattice walks with 
different nearest-neighbor contacts. J. Chem. Phys., 85:1009-1017, 1986. 
T.H. Gormen, C.E. Leiserson, and R.L. Rivest. Introduction to algorithms. The MIT 
Press, Cambridge, MA, 1990. 



16 



TABLE 1. 
lattice. 



TABLES 

Number of conformations of a n-polymer with different number of voids on a square 



n 


u){n) 


a;o(n) 


wi(n) 




u;3(n) 




3 


2 


2 














4 


5 


5 














5 


13 


13 














6 


36 


36 














7 


98 


98 














8 


272 


270 


2 











9 


740 


734 


6 











10 


2034 


1993 


41 











11 


5513 


5393 


120 











12 


15037 


14508 


529 













4UDi ( 




iood 


Q 

o 


u 


u 


14 


110188 


104566 


5602 


20 








15 


296806 


280599 


16088 


119 








16 


802075 


748335 


53149 


591 








17 


2155667 


2002262 


151052 


2353 








18 


5808335 


5327888 


470386 


10051 


10 





19 


15582342 


14222389 


1325590 


34287 


76 





20 


41889578 


37784447 


3973361 


131298 


472 





21 


112212146 


100673771 


11119456 


416239 


2680 





22 


301100754 


267136710 


32479871 


1471874 


12293 


6 


23 


805570061 


710673806 


90361878 


4479355 


54998 


24 


24 


2158326727 


1883960171 


259195774 


14946910 


223458 


414 


25 


5768299665 


5005591512 


717505892 


44337381 


862748 


2132 



17 



FIGURES 




FIG. 1. Voids of polymers in square lattice. Unfilled circle represents the first monomer, (a) A void 
of size 1 is formed in this 17-mer. (b) The two monomers encircled shares a vertex but not an edge of a 
square and are not in topological contact. The unfilled space contained within the polymer is regarded as 
one connected void of size 4. 



18 



FIG. 2. The only six conformations of 22-nier that contain 4 voids. 



19 



CD 
O 

O 



O 
O 




"■l — 

o 

Z o 

■I—' 

o 

Q. 

X CJ 
LJJ O 



8 10 




N 



N 



o 

C\J 

o 



o 
o 

o 




(D 
N 

in 
la 



o 

Q. 

X 
LU 



CD 
CD 



CM 
O 




N 



N 



FIG. 3. Geometric properties of chain polymers by exhaustive enumeration, (a) The probabihty of void 
formation, (b) the expected number of voids contained in a polymer, (c) the expected void size, and (d) the 
expected wall size of voids. All these parameters increase with chain length. 
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FIG. 4. Voids of fixed size in polymers can have different shapes and thus sometimes different wall 
sizes, (a) The distribution of the number of observed different wall sizes for a void depends on the size of 
the void. Voids of size 5 has the maximum number of different wall sizes, (b) The expected wall size for 
voids of different size in 22-mer. 
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FIG. 5. Packing density and compactness are two useful parameters describing packing of chain poly- 
mers, (a) The expected packing density decreases with chain length; (b) For 22-mer, the majority of the 
conformations with 1-void have high packing density, namely, the size of void is small. Fewer conformations 
are found with large voids. The same pattern is observed for conformations with 2 and 3 voids; (c) The 
expected compactness fluctuates but in general decreases with chain length; (d) The relationship of average 
packing density p and average compactness p for chain polymer of length 14 — 22. Both maximally compact 
polymer [p = 1) and extended polymer (p = 0) have maximal packing density {p = 1), but polymers with 
low packing density have intermediate compactness on average. 
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FIG. 6. Geometric properties obtained by enumeration and by Monte Carlo sampling for polymers 
of chain length 9 — 22. (a) The expected number of voids, and (b) the expected size of voids. Two-step 
look-ahead sequential Monte Carlo sampling is used, and the sample size is 100,000. These data show that 
geometric properties estimated by Monte Carlo are identical to those obtained by exhaustive enumeration. 
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FIG. 7. The distribution of configurations of polymers obtained by sequential Monte Carlo method 
can be adjusted by resampling. Sequential Monte Carlo of two step look-ahead without resampling does 
not generate enough compact conformations, (a) Histogram of conformations at different compactnesses 
generated without resampling. The compactness of the majority of the conformations is less than 0.5. (b) 
Histogram of conformations at different packing density generated without resampling. The majority of 
the conformations are more extended and have higher packing density. The number of conformations with 
packing density below 0.8 is small, (c) After applying resampling technique favoring compactness of 0.6, the 
majority of the conformations have compactness between 0.5 and 0.6. Here resampling is applied at each 
sequential Monte Carlo growth step, (d) Resampling can also be applied to generate conformations with 
low packing densities with voids. Here resampling favoring low packing density is applied every 2 growth 
steps. Sample size of 200,000 is used in all calcualtions. 
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FIG. 8. Geometric properties of lattiee polymers of different lengths, estimated by sequential Monte 
Carlo method with 2-step look-ahead and resampling technique. Each Monte Carlo simulation starts with 
a sample size of 200,000. Averaged values of twenty simulations are shown, (a) The probability of void 
formation increases with chain length. Standard deviation increases slowly with the length. At chain 
length 200, the standard deviation (8.5 x 10~3) is maximum; The expected number of voids (standard 
deviations < 1.6 x 10^2) (b) and wall size (standard deviations < 0.25) (c) are linearly correlated with 
chain length; (d) The expected void size increases with chain length (standard deviations < 8.3 x 10~3); 
(e) The expected packing density decreases with chain length (standard deviations < 7.5 x 10^4); (f) The 
expected compactness decreases with chain length and reaches an asymptotic value of p = 0.18 (standard 
deviations < 5.7 x 10~4). Different resampling strategies are applied where dead conformations are removed 
and other conformations with the targeted property is duplicated. Resampling favors conformations with 
small radius-of-gyration in (a), (b), (c), (d), (e), and conformations with large weight in (f). Resampling is 
carried out every 5 steps in the process of chain growth. 
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FIG. 9. The relationship of expected packing density and compactness for long chain polymer. These 
data are estimated by sequential Monte Carlo method using 2-step look-ahead and a sample size of 
20 X 200,000 with resampling. Resampling is designed to favor compactness at specified values. The 
epxected packing density calculated by averaging from the 20 runs has the largest standard deviations for 
100-mer, and are shown in the figure. 
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j = 23 i = 4 
k = \i -j\ = 19 



FIG. 10. The starting position of a void and its wall interval, (a) This 24-nicr has a void that starts at 
4 and end at j = 23. Its wall size is A; = 19. (b) This polymer has contact-loop but contains no void. 
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FIG. 11. The end-effect of void formation on conformational reduction, (a) Conformational reduction 
factor R when voids arc formed in a 22-cliain as examined by enumeration. R depends on the starting 
position and the wall interval of void; (b) Conformational reduction factor R upto a normalizing constant 
when voids are formed in a 50-chain as sampled by sequential Monte Carlo (standard deviations < 6.2). 
(c) Scaling of conformational reduction factor R and the wall interval k for 50-chain. (standard deviations 
< 6.4). 
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FIG. 12. Histograms of conformations of 100-mers generated by sequential Monte Carlo with and with- 
out resampling at different compactness. In (c) and (d), resampling is applied to every step of the chain 
growth process. All weighted histogram is normalized so the total area equals to the total number of sur- 
viving conformations reaching 100-mer. (a) Histogram of conformations at different compactness generated 
without resampling, (b) Weighted histogram of conformations generated without resampling, which is pro- 
portional to the distribution of all geometrically feasible 100-mcrs. (c) Histograms of conformations at 
different compactness when resampling is applied. To resample by weight, dead conformations are replaced 
with conformations of highest weight. To resample by compactness, dead conformations are replaced with 
conformations of lowest compactness. Note that the total number of surviving conformations that reach 
chain length of 100 is much higher then without resampling. Resampling by compactness generates many 
more conformations with higher compactness, (d) The weighted histograms of conformations under different 
resampling are in excellent agreement with each other and with that when no resampling is applied. 
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